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ABSTRACT 


The  use  of  phase-only  representations  of  speech  for 
isolated  word  recognition  is  explored.  Until  recently  the 
ear  was  thought  to  be  short-term  phase  insensitive. 

However,  short-term  phase-only  reconstructed  speech  has  been 
shown  to  retain  much  of  the  intelligibility  of  the  original 
signal.  Using  cepstral  and  analyticjsignal  processing 
techniques,  a  system  for  isolated  word  recognition  is 
developed.  The  results  of  tests  for  both  the  speaker- 
dependent  and  speaker-independent  case  indicate  that  phase 
may  be  an  important  feature  to  consider  in  the  development 


of  word  recognition  systems 
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I.  INTRODUCTION 


As  the  complexity  of  man’s  machines  increases,  so  does 
the  need  for  simple,  efficient  man-machine  interfaces. 
Automatic  speech  recognition  plays  a  major  role  in  this  man- 
machine  communication  because  of  the  superiority  of  speech 
over  other  modes  of  human  communication.  Speech  is  the  most 
familiar  and  most  convenient  way  for  humans  to  communicate. 
Voice  input  leaves  the  hands  and  eyes  of  the  operator  free 
to  perform  other  tasks  and  allows  speaker  mobility. 

Word  recognition  is  one  facet  of  the  research  conducted 
in  the  area  of  speech  processing.  Speech  processing  can  be 
divided  into  three  major  categories.  The  speech  analysis 
area  includes  word  recognition,  speaker  identification,  and 
speaker  verification.  The  second  category  is  speech 
synthesis.  An  example  of  synthesis  is  a  data-retr ieval 
system,  where  the  computer  responds  verbally  when  its  data 
base  is  interrogated.  Another  example  is  when  a  child 
receives  a  verbal  response  from  his  toy  informing  him  he  has 
correctly  answered  a  question.  The  third  area  is  a 
combination  of  the  first  two,  speech  analysis  followed  by 
speech  synthesis.  This  has  application  in  secure  voice 
transmission  and  speech  data  rate  reduction.  As  an  example 
of  the  latter,  the  telephone  company  requires  64K  bits/sec 


to  transmit  speech.  The  Department  of  Defense  standard  for 
data  rate  reduction  is  2.4K  bits/sec.  The  Air  Force  is 
experimenting  with  data  rates  as  low  as  150  bits/sec  which 
provides  intelligible  speech. 

The  advent  of  the  general  purpose  digital  computer 
in  the  mid-1960s  provided  speech  researchers  with  a 
powerful  tool.  Numerous  speech  processing  algorithms 
using  digital  signal  process  techniques  have  been  developed 
for  both  analysis  and  synthesis.  From  using  dynamic 
programming  to  time-warp  speech  prior  to  processing,  to 
algorithms  for  extracting  parameters  to  be  U3ed  for  speech 
synthesis,  speech  processing  is  a  billion  dollar  a  year 
business . 

Various  speaker-dependent  word  recognition  systems  are 
commercially  available.  These  systems  generally  perform 
3ome  type  of  spectral  analysis  on  the  incoming  speech 
signal.  The  recognition  process  involves  classical  pattern 
recognition  techniques.  These  systems  have  a  very  high  rate 
of  successful  recognition. 

The  success  of  these  systems  notwithstanding,  the 
problem  of  constructing  a  speaker-independent  recognition 
system  remains  unsolved.  The  solution  to  this  problem 
involves  determining  what  features  of  speech  contain  the 
information  and  hence  are  speaker  independent.  Before  one 
can  talk  about  extracting  the  information  content  from  the 
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speech  signal,  a  look  at  a  model  of  how  humans  produce 
speech  is  in  order. 

A.  FUNDAMENTALS  OF  SPEECH 

Flanagan  [Refs.  1  and  2]  formulated  a  generally  accepted 
model  for  human  speech  production.  His  model  describes  the 
vocal  tract  as  a  nonuniform  acoustic  tube  connecting  the 
vocal  cords  and  the  lips.  In  an  adult  male  the  vocal  tract 
is  approximately  17  cm.  in  length. 

The  vocal  tract  can  be  connected  to  an  ancillary  cavity 
called  the  nasal  cavity.  The  coupling  is  accomplished 
through  a  trapdoor  mechanism  called  the  velum.  The  nasal 
cavity  begins  at  the  velum  and  terminates  at  the  nostrils. 

In  an  adult  it  is  about  12  cm.  long.  When  non-nasal  sounds 
are  produced  the  velum  closes,  thereby  sealing  off  the  nasal 
cavity . 

Humans  are  capable  of  producing  two  types  of  sounds, 
voiced  and  unvoiced.  In  the  case  of  voiced  sounds  air  moves 
over  the  vocal  cords  causing  them  to  vibrate  in  a  quasi- 
periodic  fashion.  Unvoiced  sounds  are  generated  by  either 
forming  a  constriction  in  the  tract  and  forcing  the  air 
through  at  high  velocity  or  by  allowing  pressure  to  build  up 
behind  the  closure  and  then  releasing  it  suddenly.  The  name 
fricative  is  associated  with  the  former  while  plosive  is  the 
name  given  to  the  latter. 


Since  the  physical  configuration  of  the  vocal  tract 
changes  with  time,  Flanagan's  model  can  be  represented  as  a 


linear  time-varying  system  as  shown  in  Figure  1. 1. 


x(t) 

Time  Varying 
Filter 

y(t) 

v(t) 

Figure  1.1.  Model  of  Speech  Production 

If  it  is  assumed  that  the  vocal  tract  changes  slowly 
with  time  the  output  can  be  approximated  by  the  short-term 
convolution  of  the  excitation,  x(t),  and  the  vocal  tract 
impulse  response,  v(t).  For  voiced  sounds  x(t)  is 
quasiperiodic  hence  the  output  y(t)  is  also  quasiperiodic . 
For  the  unvoiced  case  the  excitation  x(t)  is  random  and  is 
generally  approximated  by  white  noise. 

If  the  vocal  tract  impulse  response  of  an  individual 
could  be  obtained,  then  using  the  time  varying  linear 
system  model  intelligible  speech  should  be  able  to  be 
generated.  The  excitation  would  either  be  periodic  or 
random  depending  on  whether  voiced  or  unvoiced  sounds  are 
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desired.  Figure  1.2  is  a  simplified  speech  synthesis 
machine  where  the  vocal  tract  parameters  are  stored  in  the 
RAM  and  downloaded  to  the  voice  synthesis  chip  which  is 
excited  by  either  the  periodic  or  the  random  signal.  This 


Figure  1.2.  Voice  Synthesis 

type  of  speech  synthesis  arrangement  is  whe  basis  for  Texas 
Instruments'  (TI)  Speak  and  Spell  toys.  TI  can  custom 
manufacture  a  speech  synthesis  chip  which  will  emulate 
anyone's  voice  for  $15,000. 


These  voiced  and  unvoiced  sounds  are  combined  in  a 
unique  fashion  to  form  phonemes,  the  basic  building  blocks 
of  language.  All  languages  can  be  reduced  to  a  finite 
number  of  these  distinguishable  building  blocks.  Phonemes 
are  of  such  fundamental  importance  that  if  one  phoneme  is 
exchanged  for  another  the  meaning  of  an  utterance  is 
completely  altered. 

Thus,  in  theory,  if  a  machine  could  be  designed  to 
disassemble  utterances  into  their  phoneme  components  the 
speech  recognition  problem  would  be  completely  solved. 
Despite  vast  amounts  of  time,  effort,  and  money  expended, 
however,  the  phoneme  disassembler  is  years  away  from 
becoming  an  appears  to  be  reality. 

B.  SPEECH  RECOGNITION  MACHINES 

While  the  phoneme  disassembler  does  not  exist,  several 
types  of  speech  recognition  systems  are  commercially 
available.  The  majority  of  these  systems  are  classified  as 
isolated  word  recognizers.  As  the  name  implies  the  systems 
are  designed  to  recognize  isolated  words.  The  vocabulary  of 
these  machines  is  usually  limited  to  100-300  words  and  these 
systems  are  extremely  speaker  dependent.  Thus,  a  person 
desiring  to  use  these  machines  must  first  train  the  machine 
to  recognize  his  voice.  During  the  training  phase  the 
speaker's  utterances  are  processed  and  templates  formed. 

The  recognition  process  involves  comparing  the  incoming 


utterance  with  those  templates  stored  in  the  machine's 
memory  [Ref.  3]*  Although  these  machines  have  a  limited 
vocabulary  and  cannot  recognize  connected  or  conversational 
speech,  they  are  extremely  useful  for  inventory  control, 
quality  assurance  control,  or  for  a  pilot  to  check  the 
systems  in  a  combat  aircraft.  In  all  these  instances  the 
vocabulary  is  limited,  the  speaker  is  known,  and  voice  data 
entry  frees  the  individual  to  perform  other  tasks. 

ITT  has  developed  a  word  recognition  system  for  the  Air 
Force's  F-16  fighter.  The  system  is  capable  of  recognizing 
300  words  and  allows  the  pilot  to  check  the  status  of 
certain  systems  while  he  maintains  two  hand  control  of  the 
plane.  This  two-hand  control  is  particularly  important 
during  low  level,  high  speed  attack  runs.  The  pilots 
up-date  their  voice  patterns  monthly  or  if  their  voice 
changes  due,  say,  to  a  cold.  The  patterns  are  stored  in  a 
bubble  memory  and  inserted  into  the  system  prior  to 
take-off.  The  microphone  is  located  inside  the  pilot's 
oxygen  mask  and  the  system  status  is  displayed  on  the 
cockpit's  CRT.  At  a  recent  demonstration  of  this  system  it 
had  a  correct  recognition  rate  of  99%. 

The  NPS  Speech  Processing  Laboratory  acquired  an  iso¬ 
lated  word  recognition  system  for  experimentation  purposes. 
The  system  is  the  VRM  Voterm-2  manufactured  by  Interstate 
Electronics  Corporation.  The  system,  acquired  in  1981, 
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weighs  10  lbs.  and  cost  $2500.  Today  the  same  system  has 
been  reduced  to  a  four  chip  set,  for  a  cost  of  $1000. 

The  operation  of  the  VRM  is  typical  of  the  word 
recognition  systems  currently  available  [Ref.  4],  It  allows 
the  user  to  select  the  vocabulary  size,  decision  threshold 
and  number  of  training  passes.  It  also  allows  for  reference 
pattern  transfer  between  itself  and  the  host  computer.  The 
host  computer  serves  only  as  a  mass  storage  device  and 
controller.  All  processing  and  recognition  is  performed 
real-time  by  the  VRM. 

The  input  speech  signal  is  analyzed  by  a  16-filter 
analog  spectrum  analyzer  and  then  passed  through  an  A/D 
converter.  This  digitized  speech  data  is  then  converted  to 
a  fixed-size  (120  bit)  pattern  that  preserves  the  informa¬ 
tion  content  of  the  utterance.  During  the  training  phase 
the  VRM  rejects  utterances  that  do  not  sufficiently  agree 
with  previous  training  samples  of  the  word.  This  rejection 
leads  to  a  reduction  of  the  number  of  ’ones'  stored  in  the 
pattern.  After  seven  training  passes  the  pattern  contains 
approximately  one  hundred  ’zeroes'. 

In  1980,  NATO  and  the  Rome  Air  Development  Center  (RADC) 
[Ref.  5]  conducted  a  comparison  test  on  three  isolated  word 
recognition  systems.  The  vocabulary  used  consisted  of  the 
ten  single  digits  of  the  respective  languages  of  the 
speakers.  The  machines  evaluated  were  the  VRM  system,  the 
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Threshold  Technology  8040  Preprocessor  (cost  $50,000)  and 
the  Nippon  Electric  DP-100  (cost  $60,000). 

Table  1.1  lists  the  results  from  the  RADC  test  [Ref.  6]. 
Each  speaker  trained  the  machines  by  repeating  each  digit 
ten  times.  No  attempt  was  made  to  introduce  speakers  who 
had  not  trained  the  machine.  However,  tests  run  at  the 
Speech  Processing  Lab  with  the  VRM  with  some  non-trained 
speakers,  using  the  ten  digits  and  three  sets  of  reference 
patterns  the  successful  recognition  rate  for  new  speakers 
was  less  than  30%. 

Thus,  these  systems  work  extremely  well  for  what  they 
were  designed  to  accomplish.  As  previously  stated,  the 
basic  question  of  what  parameters  of  speech  are  speaker 
independent  still  remains  unanswered.  Numerous  theories 
have  been  proposed  and  all  have  been  unsuccessful.  There  is 
a  lack  of  understanding  of  the  human  mechanisms  used  in 
understanding  speech. 


II.  MODELS  OF  THE  EAR 


For  a  long  time  people  have  been  trying  to  understand 
how  the  human  ear  functions.  In  the  first  century  B.C.,  the 
Roman  poet,  philosopher  Lucretius  postulated  a  model 
"involving  little  grains  of  sand  in  the  inner  ear  responding 
too  different  tones"  [Ref.  73.  The  18th  century  Italian 
violinist  Tartini  noted  that  the  ear  produced  a  third  tone 
from  two  tones  played  simultaneously.  Thus  the  long  held 
belief  that  the  ear  was  a  linear  device  was  demonstrated  to 
be  false.  Today  the  ear  is  thought  to  be  a  nonlinear  device 
even  at  power  levels  near  the  threshold  of  hearing. 

The  first  concentrated  research  into  the  process  of 
hearing  did  not  begin  until  the  mid-1 800's.  This  was  the 
time  of  Seebeck,  Helmholtz,  and  Ohm.  It  was  Ohm  who 
postulated  a  now  famous  law  on  the  relationship  of  speech 
and  its  phase  angle.  He  stated  that  all  the  information 
content  of  speech  is  contained  in  its  power  spectrum  and  was 
independent  of  the  phase  angle  of  the  components.  Although 
Ohm’s  law  has  been  modified  in  recent  years,  it  remains  as 
one  of  the  fundamental  laws  of  psychoacoustics. 

The  ear  can  be  broken  down  into  three  physical  areas; 
the  outer,  middle  and  inner  ear.  Sound  waves  impinge  on  the 
outer  ear  and  are  conducted  down  a  canal  until  they  reach 
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the  middle  ear.  The  middle  ear  contains  three  tiny  bones. 

The  alternate  compressions  and  refractions  of  the  speech 
wave  cause  the  eardrum  to  strike  the  bones.  In  the  inner 
ear  the  wave  travels  along  a  thin  membrane  whose  frequency 
response  varies  between  100  Hz  and  20  KHz.  This  provides 
for  spectral  analysis  of  the  incoming  signal. 

The  membrane  of  the  inner  ear  is  lined  with  tiny  hairs. 

It  is  these  hairs  or  more  correctly  groups  of  hairs  that 
perform  the  spectral  analysis.  Recent  studies  at  the 
California  Institute  of  Technology  [Ref.  8]  have  found  that 
each  tiny  hair  bundle  consists  of  30-150  thin,  rod-shaped 
extensions  called  cilia.  These  hair  bundles  are  attached  to 
hair  cells.  The  hair  cells  are  very  sensitive  transducers 
which  convert  the  movement  of  the  hair  bundle  into  an  elec¬ 
trical  signal  which  is  sent  to  the  brain.  The  hair  bundle- 
hair  cell  combination  form  a  sort  of  mechanical  spectrum 
analyzer . 

Manfred  Schroeder  [Ref.  9]  describes  an  experiment  in 
which  the  inner  ear’s  sensitivity  to  phase  was  demonstrated. 
The  experiment  was  as  follows: 

1)  A  100  sec.  sample  of  speech  was  Fourier  transformed. 

2)  Random  phase  angles  were  assigned  to  the  frequency 
components  (assuming  a  uniform  distribution  0  to  2ir). 

3)  The  inverse  Fourier  transform  was  taken. 

The  resultant  signal  sounded  like  white  noise.  Thus  by 
randomizing  the  phase  angles  the  signal  was  transformed  from 
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intelligent  speech  to  noise.  This  lent  credence  to  the 
hypothesis  that  the  inner  ear  was  phase  sensitive  and  that 
Ohm's  law,  if  not  wrong,  was  at  least  in  need  of  modifica¬ 
tion.  The  experiment  was  repeated  this  time  using  a  50 
msec,  sample  of  speech.  The  resultant  signal  was  non- 
intelligible  noise.  Ohm's  law  modified  to  say  that  only  the 
short  term  amplitude  spectrum  contained  the  speech 
information  appeared  to  be  correct. 

Ohm  based  his  law  on  a  model  of  the  ear  that  said: 

1)  The  ear  has  a  tuned  bandpass  filter  covering  the 
audio  range. 

2)  Only  the  output  amplitude  of  each  filter  is  sent  to 
the  brain. 

Today  the  most  likely  candidate  for  the  bandpass  filter  are 
the  hair  bundle-hair  cell  combinations  that  respond  to  only 
selected  stimuli. 

In  1947  an  experiment  was  conducted  [Ref.  10]  in  an 
effort  to  obtain  a  definite  answer  to  the  phase  sensitive 
question.  An  AM  signal  at  2000  Hz  was  modulated  by  a  100  Hz 
signal.  Thus  three  frequency  components  (1900  Hz,  2000  Hz, 
2100  Hz)  were  present.  One  of  the  sidebands  had  its  phase 
shifted  by  180°.  This  phase  shift  resulted  in  what  was 
termed  a  quasi-FM  (QFM)  signal.  Upon  listening  to  the 
signals  there  was  a  noticeable  difference  between  the  AM  and 


QFM  signal.  Thus  there  was  a  revived  interest  in  the  ear's 


capability  to  discern  waveforms  and  not  just  their 
amplitude . 

In  a  further  effort  to  determine  to  what  extent  phase  is 
important  in  discerning  speech,  Hall  and  Schroder  [Ref.  11] 
conducted  an  experiment  where  the  phase  angle  of  one  of  two 
pure  tones  was  changed.  Specifically  two  tones  one  at  200 
Hz  and  0°  and  another  at  400  Hz  but  with  phase  angles  of  0°, 
60°,  120°,  180°,  240° ,  and  300°  were  listened  to,  three 
signals  at  a  time.  The  listeners'  task  was  to  determine 
which  two  signals  sounded  most  alike  and  which  two  sounded 
least  alike.  The  results  showed  that  those  harmonics  of 
400  Hz  whose  phase  angle  differed  the  least  were  judged  to 
be  the  most  similar  consistently. 

About  twelve  years  prior  to  this  experiment  researchers 
at  Bell  Labs  postulated  that  the  phase  dependency  seen  in 
experiments  involving  the  inner  ear  could  be  traced  to  the 
phase  dependence  of  the  inner  and  middle  ear  distortion 
products.  Due  to  the  presence  of  these  nonlinear  distortion 
products  a  new  spectrum,  called  the  inner  spectrum  was 
formed  in  the  inner  ear.  It  is  this  spectrum  that  is 
analyzed  by  the  hair  bundles  of  the  inner  ear. 

This  theory  certainly  would  explain  what  happened  at 
Bell  Labs  during  a  1958  experiment  [Ref.  12].  When  the 
phase  of  one  of  31-equal  amplitude  harmonics  all  0°  phase 
was  changed  to  a  180°  a  pure  tone  was  heard.  This  tone  was 
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not  heard  when  the  signal  was  put  through  a  loud  speaker . 

Thus  using  the  inner  spectrum  theory  changing  the  phase  of 
one  harmonic  to  180°  altered  the  amplitude  of  one  of  the 
distortion  products.  This  altered  the  inner  spectrum 
causing  a  bump  in  the  spectrum  where  previcjsly  it  had  been 
flat . 

In  Germany,  Terhardt  and  Fasti  [Ref.  13]  conducted 
experiments  trying  to  connect  frequency  difference  and  phase 
angles.  They  formed  a  signal  s(t)  =  a1  cos  (2irf«t)  +•  a2  cos 
(2irf2t-*2)  where  f^  =  200  Hz,  f2  =  400  Hz  and  asked  lis¬ 
teners  to  adjust  the  amplitude  of  each  component  so  the 
400  Hz  tone  was  just  audible.  This  was  to  be  done  while  the 
phase  angle,  $2,  of  the  400  Hz  tone  was  changed.  The 
results  showed  that  when  $2  was  changed  from  0°  to  180°,  the 
amplitude  of  the  400  Hz  signal  had  to  be  increased  by  12  dB 
to  remain  audible. 

Yet  another  theory  on  the  functioning  of  the  ear  came  out 
of  this  experiment.  The  researchers  theorized  that  the  hair 
cells  of  the  ear  were  discerning  the  time  between  successive 
spikes  in  the  waveform  and  passed  this  information  to  the 
brain.  This  appeared  as  a  reasonable  explanation  as  when 
-  0°  the  time  between  successive  spikes  was  2.5  msec.  With 
♦2  =  180°  the  time  between  spikes  was  5  msec.,  unless  the 
amplitude  of  the  400  Hz  tone  was  increased  by  considerable 


amount.  With  the  amplitude  increased  the  small  spikes  at  the 
2.5  msec,  mark  would  increase  dramatically. 

This  theory  is  consistent  with  the  physiology  of  the 
ear.  All  the  electric  pulses  transmitted  to  the  brain  from 
the  hair  cells  have  approximately  the  same  amplitude,  thus 
the  timing  between  the  pulses  is  the  information  that  they 
carry. 

From  the  myriad  of  theories  presented  it  is  easy  to 
conclude  that  a  definitive  model  of  the  human  ear  is  non¬ 
existent.  The  fact  that  phase  contains  some  information 
content  has  been  demonstrated .  Whether  phase  alone  is  the 
speaker  independent  feature  that  researchers  are  looking  for 
remains  an  unanswered  question.  Experiments  conducted  in 
the  late  1970's  and  1980's  using  phase-only  representations 
of  speech  have  given  some  creditability  to  the  hypothesis 
that  phase  must  be  included  as  one  of  the  speaker  indepen¬ 
dent  features  of  speech. 
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III.  PHASE-ONLY  REPRESENTATIONS  OF  SPEECH 


Recapitulating,  Ohm’s  law  stated  that  all  the 
information  content  of  speech  could  be  obtained  from  the 
short  term  power  spectrum  and  that  phase  angle  of  the 
components  was  meaningless.  Thus,  in  the  short  term  the  ear 
is  phase  deaf.  Oppenheim  [Ref.  14]  sought  to  explore  more 
fully  the  importance  of  phase  in  speech. 

Given  the  Fourier  transform  of  a  speech  signal 

FU)  =  |F(M)|ej9(u)  (3.D 

and  if  the  |F(«)|  is  set  equal  to  one,  the  inverse  transform 
of  e^9^  is  a  phase  only  representation  of  the  speech. 

This  phase  only  representation  retained  total  intelligi¬ 
bility,  while  exhibiting  the  characteristics  of  being  high 
passed  filtered  and  having  white  noise  added.  The  magnitude 
only  representation  was  speech-like  in  its  appearance  but 
was  not  intelligible. 

Oppenheim  concluded  that  transforming  a  signal  to  its 
phase  only  form  was  equivalent  to  passing  it  through  a 
spectral  whitening  process  with  a  filter  whose  response  is 
H(x)  *  1/|F(x)|,  where  F(x)  is  the  Fourier  transform  of  the 
original  signal.  This  spectral  whitening  did  not  destroy 
the  intelligibility  of  the  speech. 


Contrary  to  Ohm's  law,  Cox  and  Robinson  [Ref.  153 
conducted  a  series  of  four  experiments  which  preserve  the 
short  terra  phase  of  a  speech  signal  while  either  destroying 
or  severely  distorting  the  amplitude.  These  phase-only 
signals  were  found  to  retain  many  speech  characteristics  and 
were  intelligible  to  the  listeners.  Hence  under  certain 
transformations  short  term  phase  may  be  one  of  the  physical 
invariants  of  speech. 

The  experiments  used  a  speech  signal  that  was  analog 
band  limited  to  8  KHz  and  sampled  at  a  rate  of  20  KHz  with 
12  bits  A/D.  Successive  25.6  msec  windows,  corresponding  to 
512  data  points,  were  fast  Fourier  transformed.  Nonlinear 
operations  were  applied  to  each  data  set,  and  the  inverse 
fast  Fourier  transforms  were  taken  yielding  25.6  msec  of 
reconstructed  speech  signal.  These  signals  were  D/A 
converted  at  a  rate  of  20  KHz  and  passed  through  a  8  KHz  low 
pass  analog  filter.  Only  rectangular  windows  were  used  and 
no  attempt  was  made  to  fit  the  windows  together  since 
amplitude  of  the  reconstructed  signal  was  umimportant.  The 
first  two  experiments  are  included  for  completeness  only. 

The  latter  two  are  the  concern  of  this  thesis. 

A.  SHORT-TERM  PHASE  ONLY  SIGNALS 

This  experiment  basically  repeated  the  previously 
mentioned  work  of  Oppenheim,  as  the  magnitude  of  the  Fourier 
transform  of  the  data  sets  was  set  equal  to  one.  The  phase 


was  unchanged.  The  reconstructed  short-term  phase  only 
signal  was  found  to  retain  many  of  the  original  waveform's 
features.  Listeners  could  identify  speaker  dependent 
characteristics  and  the  intelligibility,  while  not  judged 
good,  was  likened  to  a  signal  containing  a  lot  of  noise. 
There  was  no  attempt  made  by  the  researchers  to  clean  up  the 
signal.  The  results  of  this  experiment  clearly  are  contrary 
to  Ohm's  law  and  demonstrate  that  short-term  phase  only 
speech  is  intelligible. 

B.  ANALYTIC  SIGNAL  PROCESSING 

The  second  experiment  was  a  repeat  of  one  carried  out  in 
the  late  1940's.  Here  the  representation  is  an  infinitely 
clipped  version  of  the  original  signal 

Sc( t )  ' =  Sgn  [s(t)3  (3.2) 

where  s(t)  is  the  original  signal,  and  Sgn  is  defined  to  be 
the  sign  of  s(t).  Thus  the  continous  valued  signal,  s(t), 
was  transformed  into  a  discrete  valued  signal.  The 
transformation  retains  only  the  real-zero  information  of 
s(t).  That  is,  if  s(t)  was  an  analytic  signal  the  real- 
zeros  mark  the  time  when  the  phase  was  changed  by  180°.  The 
intelligibility  of  such  a  signal  was  not  commented  on  by  the 
experimenters,  however,  they  did  say  that  large  amounts  of 
speech  information  were  retained  using  this  transform. 
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C.  DIRECT  PHASE  CEPSTRUM 


The  concept  of  cepstral  analysis  of  speech  was  developed 
by  Oppenheim  [Ref.  16]  and  is  an  example  of  a  broad  class  of 
nonlinear  processing  called  homomorphic  processing.  These 
homomorphic  systems  obey  generalized  laws  of  superposition. 
If  x^Cn)  and  x2(n)  are  inputs  to  a  homomorphic  system  and 
y 1 ( n ) ,  y2(n)  are  corresponding  outputs  and  k  is  any  scalar 
then 

y  ^  (n)  =  <j> C x  1  (n)  ] 
y2(n)  s  <*[x2(n)] 

♦Cx^ (n)  a  x2(n)]  =  ♦Cx1(n)]  □  *[x2(n)] 

4>[kO  x .,  ( n )  ]  s  k  *  y-j(n) 

where  a,  □  ,  O  >  and  *  are  mathematical  operations. 

The  importance  of  these  homomorphic  systems  is  that  <t> 

can  be  broken  down  into  a  cascade  of  operations  as  shown  in 

Figure  3.1  where  A„,  A  are  inverses  of  each  other  and  L 

o  o 

is  a  simple  linear  filter. 

Thus  Oppenheim  [Ref.  171  formulated  a  model  for  the 
production  of  speech  as  shown  in  Figure  3.2.  The  model  is 
based  on  the  assumption  that  the  excitation  and  vocal  tract 
parameters  are  independent.  The  source  of  excitation  for 
the  voiced  sounds  is  the  impulse  generator  whose  period  is 
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Figure  3.2.  Model  for  Speech  Production 


controlled  by  the  pitch-period  signal.  The  impulse 
generator  produces  an  impulse  once  every  NQ  samples,  where 
Nq  is  the  pitch-period  and  1/Nq  is  the  pitch  frequency.  Th 
unvoiced  excitation  is  from  the  random  number  generator  and 
simulates  both  fricative  and  plosive  sounds.  The  digital 
filter  is  assumed  to  be  slowly  varying  with  time  and  hence 
changes  its  coefficients  once  every  10  msec.  The  amplitude 
control  simply  adjusts  the  output  level  of  the  speech. 

Using  this  model  the  output  digitized  speech  waveform 
consists  of  the  convolution  of 

(1)  The  train  of  impulses  representing  the  pitch 

(2)  The  excitation  pulse 

(3)  The  vocal  tract  impulse  reponse . 

If  x(n)  denotes  the  output  signal,  then 

x(n)  =  [ p( n )  *  e(n)  *  u(n)]  w(n)  (3.3 

where  p(n)  is  the  train  of  pitch  pulses,  e(n)  is  the 
excitation  pulse,  u(n)  the  vocal  tract  impulse  response,  an 
w(n)  the  window  through  which  the  speech  is  viewed.  The 
window  w(n)  is  smooth,  hence  we  can  define 

p(n)  =  p(n)  w(n) 


(3.  4 


Then  substituting  this  into  equation  ( 3 -  3 )  it  is  possible  to 
approximate  x(n)  by 


x(n)  »  p(n)  *  e(n)  *  u(n) 


(3.5) 


Examining  equation  (3*5)  it  is  possible  to  convert  the 
triple  convolution  into  a  triple  sum  by  first  taking  the 
Fourier  transform  and  then  taking  the  logarithm.  Processing 
of  this  signal  can  be  accomplished  by  a  linear  system  and 
recovery  of  the  waveform  can  be  made  by  passing  the 
processed  signal  through  an  exponentator  followed  by  inverse 
Fourier  transformer.  Thus  a  homomorphic  system  for 
processing  speech  has  been  developed,  as  shown  in  Figure  3*3 
[Ref.  18]. 


f  I  1  I  Linear 


Figure  3-3.  Homomorphic  System  for  Processing  Speech 


Variations  on  this  basic  system  have  been  developed  to 
estimate  parameters  of  both  the  vocal  tract  transmission 


functions  and  the  excitation  functions.  One  of  these 


variations  involves  making  the  assumption  that  the 

A 

excitation  is  s(n)  =  p(n)  *  e(n),  then  equation  (3.5)  can  be 
written  as 


x(n)  =  u(n)  *  s(n) 


(3.  6) 


The  system  to  process  signals  given  by  equation  (3.6)  is 
shown  in  Figure  3*4  [Ref.  19]. 

Referring  to  Figure  3.^»  the  signal  at  A  is  x(n)  and  the 
signal  at  D  is  called  the  cepstrum  of  x(n)  and  equals  the 
cepstra  of  the  excitation  plus  the  cepstra  of  the  vocal 
tract  impulse  response. 


Data  Ceptrum 

Window  Window 

Figure  3«^.  Cepstral  Processing  of  Speech 


An  important  feature  of  the  cepstrum  at  D  is  that  it 
separates  the  excitation  from  the  vocal  tract  response.  The 
excitation  is  a  sequence  of  quasi-periodic  pulses,  thus  its 


Fourier  transform,  at  point  B,  is  a  line  spectra  where  the 
lines  are  spaced  at  harmonics  of  the  fundamental  frequency. 
The  log  magnitude  operation  does  not  effect  the  general 
shape  of  the  spectra.  The  IDFT  of  the  signal  produces 
another  quasi-periodic  waveform  with  pulses  spaced  at  the 
fundamental  period.  Thus  the  cepstrum  of  the  excitation 
should  consist  of  pulses  around  n  =  0,  T,  2T,  ...,  where  T 
is  the  pitch  period. 

The  DFT  of  the  vocal  tract  response  is  a  slowly  varying 
function  of  frequency.  The  log  magnitude  and  IDFT  yield  a 
sequence  that  is  negligible  after  a  few  samples.  The  cep¬ 
strum  at  D  consists  of  two  sequences,  one  which  is  negligi¬ 
ble  after  a  few  samples  and  one  that  is  periodic.  Thus  the 
cepstrum  at  D  does  differentiate  the  excitation  from  the 
vocal  tract  parameters.  The  use  of  the  cepstral  processing 
has  been  extended  into  many  diverse  fields  [Ref.  20]. 

For  their  third  experiment,  Cox  and  Robinson  [Ref.  21] 
modified  Figure  3-^  by  setting  the  magnitude  of  the  signal 
at  point  C  equal  to  one.  Hence  the  cepstrum  at  point  D  is 
due  only  to  the  phase  of  the  signal  at  A.  What  amount  of 
information  and  intelligibility  does  this  phase  only 
cepstrum  contain?  Surprisingly  the  cepstrum  was  judged  to 
be  very  intelligible  by  listeners  and  the  noise  level  was 
reduced  when  compared  with  the  short-term  phase  only  speech 
(experiment  number  one). 
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D.  INSTANTANEOUS  PHASE  OF  THE  ANALYTIC  SIGNAL 


The  fourth  experiment  performed  by  Cox  and  Robinson 
[Ref.  22]  was  first  performed  in  1955  by  Marcoui  and  Daguet 
who  were  looking  for  more  efficient  modulation  techniques. 
They  sought  to  use  the  analytic  signal  representation  of  a 

real  signal  s(t).  Given  a  real  signal  s(t),  which  is 

* 

Hilbert  transformable,  form  a  quadrature  signal  s  (t)  and 
construct 

m(t)  =  s(t)  +  j  s*(t)  (3.7) 

From  equation  (3-7)  it  is  possible  to  recover  the  original 
signal  as 

3 ( t )  =  RE[m(t)]  =  | m ( t ) |  cos  e(t)  (3.8) 

Equation  3.8  lets  the  real  signal,  s(t),  be  represented  by  a 
magnitude  and  phase. 

The  concept  of  an  analytic  signal,  which  equation  (3.7) 
is  called,  was  meaningless  for  discrete-time  signals,  until 
Rabiner  and  Schafer  [Ref.  23]  developed  a  complex  represen¬ 
tation  for  real  discrete-time  bandpass  signals. 

Following  the  notation  of  Rabiner  and  Schafer,  given  a 
real  sequence,  x(n),  with  Fourier  transform  X(w),  construct 
a  complex  sequence 


«-*  A.'-  V  •>  -  •■*Ly~  V.  ' 
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x(n)  =  x(n)  +  j  x (n) 


The  Fourier  transform  of  which  is 


X(  u)  s  2  X(  w)  0<u<ir 


<  b)  <  2 1 


From  equation  (3*9)  the  Fourier  transform  of  x(n)  is 


XU)  =  XU)  +  j  XU) 


and  from  equation  (3*10)  it  follows  that 


X(<o)  +  j  XU)  =0  IT  <_  u  <  2 IT 


XU)  =  2XU) 


0  <  u>  <  it 


These  requirements  are  satisfied  if 


XU)  =  HdU)  XU) 


where 


Hrf(-)  =  -j 


0  <  u>  < 


=  +  j 


<  u  <  2  ir 


Thus  given  any  sequence  x(n)  ,  it  is  possible  to  obtain  the 

A 

sequence  x(n)  by  linear  filtering  of  x(n)  with  a  filter 
whose  frequency  response  is  given  by  equation  (3. 13).  Such 

A 

a  filter  is  called  an  ideal  Hilbert  transformer  and  x(n)  is 
the  Hilbert  transform  of  x(n) .  The  impulse  response  of  the 
ideal  Hilbert  transformer  is 


hd(n> 


2  sin2  (J2) 
n 


n  i  0  (3.14) 


=  0 


n  =  0 


Examining  equation  (3.14),  the  impulse  response  is  non- 
causal,  of  infinite  duration,  has  odd  symmetry,  and  all 
even-numbered  samples  are  equal  to  zero  (i.e.,  hd(2n)  s  0, 
n  s  0  ,  hh  1  ,  +2 ,  +3 1  « • « )  • 

Since  infinite  length,  non-causal  impulse  responses  are 
not  realizable  an  FIR  approximation  is  required.  Given  a 
causal  FIR  system  whose  impulse  response  is  h(n)  ,  0  <  n  <_  N-1, 
its  frequency  response  is  given  by 

N-1 

HU)  =  E  h(n)e'jun  (3-15) 

n=0 

Equation  (3.13)  says  the  desired  frequency  response,  Hd(u>),  is 
purely  imaginary.  Thus  the  real  part  of  equation  (3. 15)  must 
equal  zero  as  h(n)  is  real.  In  order  for  the  real  part  of 
equation  (3.15)  to  be  zero  h(n)  must  satisfy  the  symmetry 


condition 
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h(n)  =  -h(N-l-n) 


n  =  0 , 


•  •  •  I 


N-1 . 


(3.16) 


If  N  is  odd,  h(n)  has  odd  symmetry  about  n  =  (N-1)/2.  If  N  is 
even,  h(n)  has  odd  symmetry  about  a  point  halfway  between  the 
samples  at  n  =  N/2  and  n  =  (N/2)  +  1.  If  equation  (3.16)  is 
satisfied,  equation  (3.15)  can  be  written  as 

HU)  =  e-J«(N-1)/2  [  j  H*  (  u)  ]  (3.17) 

*  * 

where  H  («)  is  a  real  function  of  u>.  If  N  is  odd,  H  (w)  can 

be  written  as 

(N-1)/2 

H*  ( u)  =  H  a(n)  sin(un)  (3-18) 

n=  1 

where  a(n)  =  2h  (-5^-  -  n)  ,  n  =  1,  2,  ...,  (^-)  (3-19) 

Also  for  N  odd, 

h(^jl)  =  0  (3.20) 

For  N  even,  equation  (3.18)  becomes 

N/2 

H*( «)  =  £  b(n)  sin[u(n  -  1/2)  (3-21) 

n=  1 

where  b(n)  s  2h(*  -  n)  .  n  =  1 . N/2 


Examining  equation  (3.17)  more  closely,  we  find  that  the 
factor  e-j“(N-1)/2  ls  a  delay  of  (N-1)/2  samples. 

In  finding  an  approximation  to  the  ideal  Hilbert 

transform,  coefficients  a(n)  and  b(n)  were  chosen  in  such  a 

* 

fashion  that  jH  U)  approximates  the  ideal  frequency  response 
given  by  equation  (3«13).  Thus  H  (u)  must  approximate 

DU)  =  -1  2 itF l  <_  «  <  2ttFh  (3.22) 

=  +1  2 it ( 1  -  Fh)  <  <  2ttFl 

where  F^  and  F^  are  the  lower  and  upper  cutoff  frequencies 

» 

represented  as  fractions  of  2».  From  equation  (3.18),  H  U) 
must  equal  zero  at  u  =  0  and  u>  =  »  when  N  is  odd  and  must 
equal  zero  at  u  =  0  for  the  case  when  N  is  even. 

For  the  ideal  transformer  the  impulse  response  was  zero 
for  all  even  numbered  samples  and  the  frequency  response  was 
imaginary,  odd,  periodic  and 

H^U)  =  Hcj(ir  -  u>). 

For  the  FIR  approximation  similar  properties  must  be 
valid.  If  N  is  odd  and  F^  s  .5  -  FH  and  assuming  that 

H*U)  =  H * (  ¥  -  u>).  (3.23) 

Then  substituting  into  equation  (3-18)  yields, 
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to)  n] 


I 


(N-1)/2  (N-1)/2 

E  a(n)  sin(nu)  s  E  a(n)  sin[(n  - 

n= 1  nsl 


(N-1)/2 

=  12  a(n)  (-1)n+1  sin  (un) 

n=1 

rearranging  terms 


(N-1)/2 

E  a(n)  sin  [un(1  -  (-1)n+1)3  =  0 
n=  1 


Thus  a(n)  z  0  neven 

z  unconstrained  n  odd. 


Combining  this  result  with  equations  (3.16),  (3.19),  and 
(3*20)  have  that  for  (N-1)/2  even,  h(n)  z  0,  for  n  z  0,  2, 
...  and  when  (N-1)/2  is  odd,  h(n)  z  0,  for  n  z  1,  3,  5,  .... 
For  the  case  of  N  even  no  relationship  among  the 
coefficients  exist. 

One  important  difference  between  even  and  odd  length 
impulse  responses  can  be  seen  in  direct  convolution.  The 
convolution  summation  given  by 


x(n) 


E  h(k)  x(n-k) 
kzO 


involves  only  (N+1)/4  multiples  per  output  sample  for  N  odd 
and  N/2  multiples  for  N  even.  The  saving  occurs  because 
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alternate  values  of  h(n)  are  zero  for  N  odd.  Because  of 
this  savings  and  for  technical  considerations  only  Hilbert 
transformers  of  odd  length  are  used. 

In  determining  the  values  of  h(n),  Rabiner  and  Schafer 
[Ref.  24]  used  the  Remez  algorithm  for  the  design  of  optimal 
FIR  filters.  The  values  of  h(n)  were  calculated  to  minimize 
the  peak  approximation  error  which  is  given  by 

G  =  MAX  [DU)  -  H*U)]  (3.24) 

2,Fl  jC  (i)  2*F^ 

The  Remex  algorithm  gives  a  Chebyshev  or  equiripple 
approximation  to  the  desired  response.  Hence  the  error 
function  is  equiripple  over  the  range  2itFl  <  u  £  2itFh* 

Given  an  N,  F^  and  F^  the  resulting  approximation  is  best  in 
the  mimimax  sense. 

Using  this  concept  of  an  analytic  signal  representation 
for  discrete-time  signals,  Cox  and  Robinson  [Ref.  25]  formed 
the  analytic  phase  representation  of  a  speech  signal.  Given 
a  sampled  speech  signal,  s(n),  they  calculated  the  Hilbert 
transform,  s*(n),  by  the  use  of  a  79-weight  Hilbert  trans¬ 
former.  Thus  having  the  analytic  signal 

m(n)  s  s(n)  +  j  s  (n) 
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the  original  signal  s(n)  is  given  by 

s( n)  =  | m( n) |  cos  0 (n) . 

The  analytic  phase  representation  is  given  by  cos  9(n).  Thus 
by  way  of  a  mathematical  artifice  a  real-valued  sequence  s(n) 
is  represented  as  having  magnitude  and  phase  with  the  phe^e 
only  being  retained.  Contrary  to  common  sense,  perhaps,  this 
analytic  phase  representation  of  speech  was  found  to  be 
intelligible.  While  these  experiments  by  themselves  do  not 
prove  that  phase  is  a  physical  invariant  of  speech,  they  do 
indicate  that  more  research  is  needed  to  determine  to  what 
role  phase  plays  in  speech  intelligibility. 

As  was  mentioned,  a  79-weight  Hilbert  transformer  was 
used  in  obtaining  the  analytic  signal.  Rabiner  and  Schafer 
[Ref.  26]  calculated  weights  for  three  different  values  of 
peak  approximation  errors  and  cutoff  frequencies.  Table  3.1 
lists  these  weights  and  Figures  3-5  through  3.7  are  plots  of 
the  magnitude  of  the  frequency  response.  Table  3-1  only 
lists  even  weights,  since  79  ii>  odd,  all  odd  weights  are 
zero  and  the  weights  have  odd  symmetry  about  n  =  39. 
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TABLE  3.1 


HILBERT  TRANSFORMER  WEIGHTS 
N  =  79 


Fl  =  .01  Fl  =  .02  Fl  =  .05 


n 

G  =  .0388830 

G  =  .0024390 

G  =  .0000010 

0 

-.0229388 

-.0019358 

-.0000041 

2 

-.0075151 

-.0017746 

-.0000179 

4 

-.0087784 

-.0025624 

-.0000550 

6 

-.0101565 

-.0035600 

-.0001389 

8 

-.0117808 

-.0048021 

-.0003074 

10 

-.0135612 

-.0063300 

-.0006182 

12 

-.0155902 

-.0081910 

-.0011532 

14 

-.0179182 

-.0104453 

-.0020239 

16 

-.0206260 

-.0131630 

-.0033761 

18 

-.0237742 

-.0164470 

-.0053956 

20 

-.0274953 

-.0204251 

-.0083167 

22 

-.0319865 

-.0252943 

-.0124372 

24 

-.0375627 

-.0313515 

-.0181511 

26 

-.0447012 

-.0390711 

-.0260178 

28 

-.0542333 

-.0492818 

-.0369200 

30 

-.0677331 

-.0635544 

-.0524475 

32 

-.0885965 

-.0852651 

-.0759556 

34 

-.1256401 

-.1232135 

-.  1161821 

36 

-.2111964 

-.2097186 

-.2053402 

38 

-.6362830 

-.6357869 

-.6343000 

,"v 


Figure  3.5. 
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Frequency  Response  of  Hilbert  Transformer 
N  -  79,  Fl  -  .01,  G  -  .0388830 
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IV.  EXPERIMENTAL  PROCEDURE 


This  thesis  extends  the  work  of  Cox  and  Robinson  to  the 
isolated  word  recognition  field.  Specifically  using  the 
homomorphic  and  analytic  signal  processing  techniques 
employed  in  experiments  three  and  four  an  isolated  word 
recognition  system  is  developed. 

A.  DATA  ACQUISITION 

In  order  to  form  a  data  base  for  use  by  the  system 
twenty  volunteers  were  recruited  to  record  the  digits  zero 
through  nine.  Each  participant  was  given  a  questionnaire/ 
instruction  sheet  like  that  contained  in  Appendix  A.  All 
speakers  were  males  between  the  ages  of  25  and  35  and  all 
were  native  English  speakers.  Their  places  of  birth  varied 
from  eastern  Pennsylvania  to  southern  Tennessee.  Ten  of 
these  speakers  were  selected  to  form  the  data  base  or 
pattern  base  of  the  system.  The  other  ten  speakers  were 
used  to  test  the  system. 

The  speech  was  recorded  on  an  analog  tape  recorder  with 
all  recordings  being  done  in  the  Speech  Processing 
Laboratory.  The  recordings  were  done  in  the  late  afternoon 
or  in  the  evening  when  the  ambient  noise  level  was  at  a 
minimum.  The  tape  recorder  used  was  the  HP-3964A  reel-to- 
reel  instrumentation  recorder  running  at  7.5  ips  using  AMPEX 
professional  audio  tape. 
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Before  this  analog  speech  could  be  digitized  an 
appropriate  bandwidth  and  sampling  rate  had  to  be 
determined.  The  power  spectral  density  of  each  digit  was 
computed  and  averaged  over  ten  utterances  of  the  digit.  The 
majority  of  the  power  was  found  to  be  below  3  KHz  except  in 
the  case  of  the  number  ’six'  where  nonnegl ig ible  power  was 
found  to  frequencies  up  to  6  KHz.  A  cutoff  frequency  of  4 
KHz  was  chosen,  which  is  exactly  half  the  bandwidth  that  Cox 
and  Robinson  used.  As  will  be  explained  later,  once  the 
bandwidth  is  fixed  the  sampling  rate  is  also  fixed.  In  this 
case  the  sampling  rate  is  fixed  at  10  KHz. 

The  machine  used  to  digitize  the  speech  was  the  GENRAD 
2505  Signal  Analysis  System  [Ref.  27].  The  system  is  a 
narrowband  (0  -  25  KHz)  signal  analysis  system  originally 
designed  for  vibrational  analysis  studies.  The  system  uses 
a  DEC  PDP  11/34A  as  the  host  computer  and  supports  two 
channels  of  A/D  conversion. 

The  heart  of  the  system,  softwarewise ,  is  GENRAD's  Time 
Series  Language  (TSL)  which  allows  the  operator  to  control 
the  A/D  converter.  TSL  is  an  interpretive  language  which 
uses  commands  similar  to  BASIC.  The  TSL  program  'ANADSK'  is 
the  routine  that  provides  analog  input  to  disk  storage. 

Given  a  bandwidth  the  'ANADSK1  routine  sets  the  sampling 
rate  at  2.56  times  the  highest  frequency  component  to 
prevent  aliasing.  The  system  provides  for  high-speed 


continuous  sampling  and  writes  the  digitized  data  to  the 
system's  Winchester  disks  in  2048  byte  blocks. 

The  two-channel  A/D  converter  has  two  6-pole  Chebychev 
filters  in  cascade  each  with  96  dB/octave  rolloff  above 
cutoff  per  channel  as  anti-aliasing  filters.  The  A/D 
converter  is  a  2  nsec  converter  with  a  12  bit  output. 

Once  the  speech  was  digitized  a  time  window  for  the 
sampled  data  had  to  be  determined.  Referring  again  to  the 
utterances  whose  power  spectral  densities  were  computed,  the 
average  length  of  the  utterances  was  740  msec.  In  order  for 
the  mathematics  to  work  out  nicely  a  750-msec  window  was 
chosen  . 

Using  TSL  library  routines  'RTIO'  and  "XDISPL'  a  routine 
was  written  that  displayed  the  digitized  data  on  the 
system's  CRT.  The  program  graphically  displayed  1024 
samples  at  a  time  and  allowed  the  operator  to  select  any  256 
samples  for  transfer  to  the  W.  R.  Church  Computer  Center's 
IBM  3033  for  processing.  This  transfer  was  via  a  1200  baud 
modem.  With  the  capability  to  view  the  data  prior  to 
transfer,  the  start  of  the  utterance  could  be  selected  to 
within  128  samples.  Since  the  time  window  was  selected  to 
be  750  msec  and  the  speech  was  sampled  at  10,240  samples/ 
sec,  7680  points  needed  to  be  transferred.  Thus  thirty 
blocks  of  256  samples  each  were  transferred  per  utterance. 


The  transfer/interface  program  between  the  Speech  Lab's 
PDP  11/34A  and  the  IBM  3033  was  written  by  LT  Jay  H.  Benson. 
A  copy  of  his  program,  'CATCH',  is  included  in  Appendix  B. 
The  transfer  of  data  via  the  modem  was  very  time  consuming 
as  for  technical  reasons  each  sample  which  occupied  two 
bytes  on  the  PDP  11/34A  was  made  into  a  four  byte  number  for 
transfer.  The  sixteen  most  significant  bits  were  then 
masked  off  prior  to  storage  on  the  IBM  system.  In  order  to 
minimize  the  amount  of  disk  storage  required,  the  data  was 
written  to  the  disk  using  an  unformatted  FORTRAN  write 
statement,  using  Integer  *  2  numbers.  Even  using  this 
scheme  to  maximize  storage  efficiency  24  cylinders  plus 
magnetic  tape  backup  were  required  to  store  the  data. 

B.  DATA  PROCESSING 

The  decision  to  use  the  IBM  system  to  process  the  data 
was  based  on  the  availability  of  library  routines  (e.g., 
IMSL,  NONIMSL),  the  DISSPLA  graphics  package,  and  the  full 
screen  text  editor.  All  programs  in  Appendix  B  were  written 
in  FORTRAN  H. 

The  first  task  was  to  compute  an  average  waveform  for 
the  speaker.  In  order  to  accomplish  this,  three  of  the 
four  utterances  of  each  of  10  speakers  were  averaged 
together.  The  program  'MEANS'  was  used  to  compute  this 
average.  The  technique  is  very  simple  and  straightforward 
as  the  ensemble  mean  was  computed.  This  agrees  with  the 


work  done  by  the  Air  Force  [Ref.  28]  where  they  assumed  that 
the  samples  are  statistically  independent,  identically 
distributed  Gaussian  random  variables.  This  is  an  over 
simplification  as  it  is  known  that  the  vocal  tract  is  slowly 
varying  with  the  tract  parameters  changing  only  every  10 
msec . 

The  short-term  cepstral  representation  of  the  averaged 
waveform  was  computed  using  the  program  ’CEP'  .  In  keeping 
with  Cox  and  Robinson  the  waveform  was  segmented  into 
25  msec  parts  and  each  part  was  processed  in  sequence. 

Finally  the  analytic  signal  representation  of  the 
waveform  was  computed  using  a  FIR  Hilbert  transformer  with 
79  weights,  and  a  lower  cutoff  frequency  of  .05.  The 
frequency  response  of  this  filter  is  shown  in  Figure  3.3. 
This  particular  filter  was  chosen  over  the  other  two  79 
weight  filters  because  of  its  very  small  approximation 
error.  The  small  approximation  error  does  imply  that  the 
transition  band  of  this  filter  is  larger  than  the  other  two 
filters,  however,  this  was  deemed  less  important  than  the 
peak  approximation  error. 

Examples  of  these  three  representations  of  the  3ame 
utterances  can  be  found  in  Figures  4.1  thru  4. 30.  These 
examples  are  of  a  male  30  years  old,  born  and  raised  in 
eastern  Pennsylvania,  and  a  Naval  cryptologic  officer.  In 
order  to  display  all  7680  points  on  one  graph  the  waveform 


was  first  normalized,  then  divided  into  four  1920  point 
parts.  Each  part  was  biased  by  (N-1)  *  2,  where  N  =  1,  2, 

3,  4,  to  permit  graphing  by  the  four  segments  on  one  page. 
The  graphs  should  be  read  from  left  to  right,  top  to  bottom. 

C.  DECISION  ALGORITHM 

Once  the  speech  had  been  processed  a  decision  algorithm 
had  to  be  formulated  to  classify  utterances  based  on  the 
patterns  collected.  All  of  the  isolated  word  recognizers 
use  a  form  of  classical  pattern  recognition  to  classify 
utterances.  The  VRM  system  uses  a  nearest  neighbor 
algorithm  with  a  variable  threshold.  If  no  utterance  is 
within  the  distance  specified  by  the  threshold,  an  unable  to 
classify  message  is  issued. 

The  nearest  neighbor  rule  is  an  example  of  the  pooled 
form  of  the  nearest  neighbor  rule  [Ref.  29].  For  the  two 
class  case,  a  hemisphere  is  formed  around  the  vector  to 
include  k  total  samples  regardless  of  their  class.  Thus 
k1  +  ^2  =  k’  where  the  number  of  vectors  belonging 

to  class  i.  The  quotient  k^/k2  is  formed  and  compared  to 
one.  If  k^/k2  >  1,  then  this  implies  there  are  more  class 
one  vectors  in  the  hemisphere  around  £  and  the  vector  £  is 
said  to  belong  to  class  one.  If  the  converse  of  the 
inequality  is  true,  k-j/kg  <  1,  then  £  is  said  to  belong  to 
class  two.  The  probability  of  error  for  the  case  k=1  is 
less  than  twice  the  minimum  probability  of  error  for  any 
decision  rule. 
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The  nearest  neighbor  rule  was  employed  to  classify  the 
utterances.  Using  the  program  'DEC1,  the  Euclidean  distance 
between  a  test  vector  and  the  stored  patterns  was  computed. 
The  results  of  this  pattern  matching  are  discussed  in  the 
next  chapter. 


Figure 
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.2.  Analytic  Representation  of  Zero 
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Figure  4.3.  Cepstral  Representation  of  Zero 

53 


SRMPLE  N 


r 


Figure 


i  i  i  i 

39UI10A  CQZnUWHON 

.8.  Analytic  Representation  of  Two 
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Figure  4.15.  Cepstral  Representation  of  Four 
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Figure  4.17.  Analytic  Representation  of  Five 
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Figure  4.28.  Sampled  Waveform,  Nine 
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V.  RESULTS  AND  CONCLUSIONS 


Ten  speakers  were  selected  to  form  the  data  base  for  the 
system.  Their  utterances  were  processed  to  obtain  both 
their  cepstral  and  analytic  phase  representations.  The 
system  was  then  tested  using  two  groups  of  speakers.  The 
first  group,  denoted  Group  A,  consisted  of  speakers  whose 
utterances  were  used  to  form  the  data  base.  Each  speaker 
repeated  the  digits  four  times,  and  only  three  of  these 
utterances  were  used  to  compute  the  average  waveform  and 
hence  the  cepstral  and  analytic  phase  representations. 

Group  A  can  be  thought  of  as  having  trained  the  system.  The 
second  group,  Group  8,  consists  of  the  other  ten  speakers. 

The  system  was  tested  using  ten  utterances  per  digit 
from  each  of  the  two  groups  of  speakers.  The  reference 
pattern  space  was  varied,  using  three  different  spaces  each 
containing  100  patterns.  The  cepstral  and  analytic 
representations  formed  two  of  the  reference  spaces,  while 
the  unprocessed  signals  formed  the  third  space.  Tables  5.1 
and  5.2  contain  the  results  of  the  test. 

The  results  for  Group  A,  in  all  categories,  are  below 
the  results  attainable  with  the  VRM  system.  For  three 
training  passes  the  VRM  system  has  a  97%  recognition  rate. 
The  high  percentage  of  recognition  for  the  unprocessed 


waveforms  was  to  be  expected  since  the  speakers  trained  the 
system  and  the  pattern  space  did  consist  of  the  average  of 
each  speaker's  utterances.  The  distances  between  the 
pattern  vectors  and  the  test  vector  were  of  the  same 
magnitude  for  the  unprocessed  waveforms,  regardless  of 
whether  the  utterance  was  correctly  identified  or  not.  In 
the  case  of  the  short-term  phase  representations  when  the 
system  correctly  identified  an  utterance,  the  distance 
between  the  test  vector  and  its  nearest  neighbor  was  an 
order  of  magnitude  less  than  all  the  other  distances.  When 
the  system  incorrectly  identified  an  utterance  all  distances 
were  of  the  same  magnitude. 

The  success  demonstrated  in  the  speaker-dependent  case 
is  not  without  cost.  As  compared  to  the  VRM  system,  which 
has  at  most  120  bits/pattern,  this  system  has  122. 8K 
bits/patterr.  (7680  two  byte  numbers).  There  was  an 
extensive  amount  of  manual  editing  involved  to  obtain  these 
patterns,  on  the  order  of  ten  minutes  per  utterance. 

However,  it  was  shown  that  short-term  phase-only  speech  can 
be  used  to  construct  a  speaker-dependent  isolated  word 
recognizer . 

The  results  for  Group  B  appear  to  be  abysmal,  however, 
several  things  must  be  considered.  First,  there  was  no  pre¬ 
processing  of  the  signals  to  time-wrap  them.  Second,  no 
features  were  extracted,  only  the  entire  waveforms  were 
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used.  Third,  the  decision  algorithm  may  have  to  be  tailored 
to  fit  the  data,  rather  than  using  a  general  purpose 
decision  rule.  Last,  but  certainly  not  least,  no  system 
exists  today  that  is  completely  speaker  independent. 

One  final  observation  concerning  Group  B.  When  the 
decision  algorithm  incorrectly  identified  any  utterance  it 
did  so  with  a  great  deal  of  bias.  In  30%  of  the  cases  where 
an  utterance  was  incorrectly  identified  the  number  'one*  was 
picked  to  be  the  nearest  neighbor. 

This  thesis  was  not  an  attempt  to  definitively  answer 
the  question,  "is  phase  a  physical  invariant  of  speech?". 

Its  purpose  was  to  show  that  phase  should  be  considered  when 
constructing  a  word  recognition  system.  This  was  accom¬ 
plished.  The  next  step  is  to  use  the  information  obtained 
from  the  phase  in  conjunction  with  other  word  recognition 
systems  to  possibly  improve  these  systems  with  the  long 
range  goal  of  solving  the  speaker-independent  word 
recognition  problem. 
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GROUP  A  RECOGNITION  RESULTS 
BASED  ON  TEN  UTTERANCES  PER  DIGIT 
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Unprocessed 

Cepstral 

Analytic 

Digits 

Waveforms 

Representation 

Representation 

0 

9 

7 

6 

1 

10 

8 

7 

2 

10 

6 

4 

3 

9 

5 

3 

4 

10 

5 

5 

5 

10 

7 

5 

6 

8 

4 

3 

7 

9 

4 

3 

8 

10 

6 

7 

9 

10 

7 

6 

AVG 

9.5 

5.9 

4.9 

TABLE  5.2 

GROUP  B  RECOGNITION  RESULTS 
BASED  ON  TEN  UTTERANCES  PER  DIGIT 
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Representation 

Representation 
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1 

1 
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3 

3 

2 

1 

1 

0 

3 

2 

0 

0 

4 

1 

0 

1 

5 

1 

0 

0 

6 

0 

0 

0 

7 

1 

0 

0 
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0 
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APPENDIX  A 


INSTRUCTION  SHEET 


Thank  you  for  participating  in  the  Speech  Processing 
Lab’s  effort  to  collect  speech  samples.  This  exercise  will 
require  about  10  minutes  of  your  time  to  complete. 

I.  Biographical  Data 

A.  Name: 

B.  Age: 

C.  Sex: 

D.  Place  of  Birth: 

E.  Occupation: 

II.  Speech  Sampling 

A.  Repeat  each  word  on  the  list  four  times,  pausing 
approximately  5  sec.  between  utterances.  (For  example:  the 
first  word  on  the  list  is  ’zero',  therefore  you  would  say: 
'zero'  (pause)  'zero'  (pause)  'zero'  (pause)  'zero'  (pause) 
'one'  (pause)  . ) 


zero 

six 

one 

seven 

two 

eight 

three 

nine 

four 

five 

B.  Repeat  the  following  exercise  3  times: 

Read  the  entire  list  of  numbers  at  your  natural  speaking 
rate  pausing  approx.  5  secs,  before  repeating  the  list.  Do 
not  pause  unnaturally  between  the  numbers.  We  are  looking 
for  continuous  speech  such  as  in  a  conversation. 


zero-one-two-three-four-f ive-six-siven-eight-nine  ( pause/repeat) 


APPENDIX  B 
COMPUTER  PROGRAMS 

All  programs  were  written  in  IBM  FORTRAN  H  to  run  on  the 
W.  R.  Church  Computer  Center's  IBM  3033.  The  programs 
access  routines  from  the  ISML  library.  The  graphics 
programs  interact  with  the  DISSPLA  graphics  package. 
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